Temperature Sales Pool
1 57.56 215 1.6101279
2 61.52 325 0.0000997
3 53.42 185 0.1348124
4 59.36 332 3.2882103
5 65.30 406 1.9550195
6 71.78 522 0.9825883
What is the equation for the regression line?
\[ \hat{Y_i} = b_0 + b_1 X_{1i} \]
We can calculate estimates of these values using their formulas:
\[ \begin{aligned} b_1 &= \frac{\mathrm{Cov}(X, Y)}{\mathrm{Var}(X)}\\ b_0 &= \bar{Y} - b_1 \bar{X} \end{aligned} \]
With these values, we can write out our regression line:
\[ Sales_i = -694.37 + 16.72 \times Temperature_i \]
In R we can use the lm() function (lm stands for “linear model”).
The function arguments are: lm(dependent_variable ~ independent_variable, data = data_name).
Let’s take a look at the output using the summary() function:
Call:
lm(formula = Sales ~ Temperature, data = icecream)
Residuals:
Min 1Q Median 3Q Max
-75.512 -12.566 4.133 22.236 49.963
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -694.369 105.048 -6.61 6.00e-05 ***
Temperature 16.715 1.592 10.50 1.02e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 38.13 on 10 degrees of freedom
Multiple R-squared: 0.9168, Adjusted R-squared: 0.9085
F-statistic: 110.2 on 1 and 10 DF, p-value: 1.016e-06
Do those match what we calculated before? How do we interpret these values?
In addition to the estimates of the intercept and slope, we have a standard error, test statistic, and p-value.
These are for testing the null hypothesis that the parameter is equal to 0 versus the alternative that it’s not equal to 0.
The test for the intercept is typically not very interesting to us. Instead, we’re more interested in the test for the slope. Because that tells us whether the predictor is useful in predicting the outcome.
The hypotheses for this test are:
Based on our summary() output, do we reject or fail to reject the null hypothesis? What does that tell us?
We could also create a confidence interval around our estimate of the slope
\[ \mathrm{CI}_{95\%} = \hat{b_1} \pm t_{crit} \times \mathrm{SE} \]
\[ \mathrm{SE}_{b_1} = \frac{s_\varepsilon}{\sqrt{SS_X}} \]
\[ s_\varepsilon = \sqrt{\frac{1}{N-2}\sum_{i=1}^N (Y_i-\hat{Y_i})^2} \]
sd_error <- sqrt( (1/(nrow(icecream) - 2)) * sum((icecream$Sales - predict(simple_regression))^2) )
sd_error[1] 38.1265
predict is a generic function for predictions from the results of various model fitting functions.
\[ SS_X = \sum_{i=1}^N (X_i - \bar{X})^2 \]
The standard error for the slope is given in the output when we call the summary() function on our model. Take a look at it again and try to find it!
To get the 95%CI we have to estimate \(\pm \ t_{crit} \times \mathrm{SE}\) where \(t_{crit}\) is the critical value that cuts off the upper 2.5% of a t-distribution with \(N - 2\) df:
We can also obtain the 95% CI using the confint() function:
2.5 % 97.5 %
(Intercept) -928.4318 -460.30714
Temperature 13.1679 20.26305
R-squared is the proportion of total variability in our outcome that is explained by our predictor(s).
It is also given in the output of the summary() function:
What does that mean?
You’ll notice that there is also an adjusted \(R^2\)
This is more important in the context of multiple regression because as you add more predictors to the model, your \(R^2\) will typically increase even if the added predictors are just explaining noise.
Therefore, adjusted \(R^2\) corrects for the number of predictors in the model.
We ran a simple linear regression to determine whether temperature predicts ice cream sales. The effect of Temperature is statistically significant and positive (b = 16.72, 95% CI [13.17, 20.26], t(10) = 10.50, p < .001). Temperature explained 91.68% of the total variability in ice cream sales.
PSC 103B - Statistical Analysis of Psychological Data